Introduction

1. Research Topic

This study focuses on Spotify Top Songs from 1956 to 2019 as the main analysis target. By analyzing the characteristics of highly popular songs across different eras, we aim to identify common traits of the most popular songs, which can help predict the potential of newly released songs to become hits.

2. Motivation

Every year, countless songs are released, and some become viral hits with millions of streams. Perhaps your song could be this year’s breakout hit.
By listening to past hit songs, we noticed that popular songs often share similar characteristics in style, lyrics, and rhythm. Songs with certain elements tend to achieve a certain level of popularity.
This led us to consider that analyzing high-traffic songs over the years could help identify the common elements of hit songs and predict the likelihood of a new song becoming popular.
Additionally, this analysis can help record labels allocate marketing budgets efficiently, directing resources to songs with the highest potential.

3. Research Methodology

We obtained the Spotify All Time Top Songs Mega Dataset from Kaggle. This study uses fourteen features for analysis (including Title, Artist, Top Genre, Year, BPM, Energy, Danceability, Loudness (dB), Liveness, Valence, Duration, Acousticness, Speechiness, Popularity).
Feature explanations: > Title: Song title
> Artist: Artist name
> Top Genre: Song genre
> Year: Release year of the song
> Beats per Minute (BPM): Song tempo
> Energy: Song energy level, higher values indicate higher energy
> Danceability: Danceability of the song, higher values indicate easier to dance along
> Loudness: Song loudness in decibels, higher values mean louder under the same volume baseline
> Valence: Positivity of the song, higher values indicate more positive mood
> Acousticness: Higher values indicate more acoustic elements in the song
> Speechiness: Higher values indicate more spoken word components
> Length: Song duration in minutes
> Popularity: Higher values indicate greater popularity, range from 0 to 100

To ensure clarity in analysis, data is segmented by decades from the 1950s to 2010s. Basic visualizations such as bar charts and scatter plots are used to analyze Top Genre, BPM, musical attributes (highlighted in red), and song duration.
We explore yearly trends in BPM and Loudness to examine if song tempo or loudness has increased over the years, reflecting changing public preferences.
Beyond basic analysis, we also study correlations among musical attributes using a pheatmap, and examine the popularity of different genres based on musical features to understand listener preferences.

4. Extended Discussion and Conclusion

Knowing the characteristics of hit songs over the years allows us to create an interactive web tool. Users can input a song and get a prediction of its likelihood to become a hit based on its features.
Possible extensions: 1. Direct user interaction
2. Recommend potentially popular songs by genre
3. Incorporate other elements, e.g., whether the song contains explicit lyrics, and how that affects popularity

Music is an art form, but in the era of big data, public preferences can also be quantified. We can predict whether new songs have hit potential, helping in strategic planning for music production and marketing.

1. Basic Analysis

Data Overview

spotify
namefreq<-spotify %>% count(Artist)
names(namefreq) <- c("word","freq")
# my_graph<-wordcloud2(namefreq)
# saveWidget(my_graph,"tmp.html",selfcontained = F)
# webshot("tmp.html","fig_1.png", delay =5, vwidth = 1200, vheight=800)
knitr::include_graphics("fig_1.png")

From the raw data, we observe the frequency of appearances of each artist. Many familiar names appear! (Note: We pre-generated the wordcloud as it can cause rendering issues in later plots.)

Genre Statistics by Decade

1960~1970

sixties_songdata %>% group_by(Genre) %>% 
  summarize(Popularity=mean(Popularity))%>%
  arrange(desc(Popularity)) %>% subset(Popularity>60)%>%
  hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
  hc_xAxis(title = list(text = "歌曲種類"))%>%
  hc_yAxis(title = list(text = "特定種類的平均熱門程度"))

The genres with average popularity above 60 show that Classic Soul was the most popular in the 1960s, but differences among genres were small.

sixties_songdata%>%
  count(Genre)%>%
  arrange(n)%>%
  hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))

Album rock was the most frequent genre, but its average popularity was not the highest, suggesting no direct correlation between quantity and popularity.

1970~1980

seventies_songdata %>% group_by(Genre) %>% 
  summarize(Popularity=mean(Popularity))%>%
  arrange(desc(Popularity)) %>% subset(Popularity>60)%>%
  hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
  hc_xAxis(title = list(text = "歌曲種類"))%>%
  hc_yAxis(title = list(text = "特定種類的平均熱門程度"))

Classic Soul remains the most popular on average in the 1970s. Album rock’s share increased, pushing down Adult Standard. Examining the popularity range shows Album Rock has the largest variability.

seventies_songdata%>%
  count(Genre)%>%
  arrange(n)%>%
  hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))

In the 1970s, the proportion of album rock increased, gradually compressing the second-largest genre, adult standard. Comparing with the previous chart, we can see that classic soul has the highest average popularity overall, but there are only four songs in the data. Therefore, we examine the range of popularity for each genre.

seventies_songdata %>% group_by(Genre) %>% 
  summarize(Popularity=range(Popularity))%>%
  arrange(desc(Popularity)) %>%
  hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
  hc_xAxis(title = list(text = "歌曲種類"))%>%
  hc_yAxis(title = list(text = "特定種類的熱門程度範圍"))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Genre'. You can override using the
## `.groups` argument.

The largest range is observed in album rock, suggesting that the large variation in popularity results in a lower average popularity for this genre.

1980~1990

eighties_songdata %>% group_by(Genre) %>% 
  summarize(Popularity=mean(Popularity))%>%
  arrange(desc(Popularity)) %>% subset(Popularity>62)%>%
  hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
  hc_xAxis(title = list(text = "歌曲種類"))%>%
  hc_yAxis(title = list(text = "特定種類的平均熱門程度"))

In the 1980s, several new genres achieved high average popularity. Interestingly, many of the top genres can be classified as having “exotic traditional” styles, such as celtic punk from Ireland, along with emerging British and Canadian music styles.

eighties_songdata%>%
  count(Genre)%>%
  arrange(n)%>%
  hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))

The proportion of album rock decreased significantly in the 1980s, replaced by more exotic-style music, suggesting that the U.S. music market became increasingly culturally diverse.

1990~2000

nineties_songdata %>% group_by(Genre) %>% 
  summarize(Popularity=mean(Popularity))%>%
  arrange(desc(Popularity)) %>% subset(Popularity>62)%>%
  hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
  hc_xAxis(title = list(text = "歌曲種類"))%>%
  hc_yAxis(title = list(text = "特定種類的平均熱門程度"))

In the 1990s, pop genres increased in popularity. The rise of pop punk corresponds to the popularity of dance hall culture. Rock and metal bands also became more popular.

nineties_songdata%>%
  count(Genre)%>%
  arrange(n)%>%
  hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))

Although album rock and alternative rock remained the main genres, their proportion gradually decreased, while pop genres gained popularity, consistent with the previous chart.

2000~2010

thousands_songdata %>% group_by(Genre) %>% 
  summarize(Popularity=mean(Popularity))%>%
  arrange(desc(Popularity)) %>% subset(Popularity>70)%>%
  hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
  hc_xAxis(title = list(text = "歌曲種類"))%>%
  hc_yAxis(title = list(text = "特定種類的平均熱門程度"))

Besides the continuing rise of pop music, hip hop also became highly popular, with many subgenres appearing among the top popular categories.

thousands_songdata%>%
  count(Genre)%>%
  arrange(n)%>%
  hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))

Dance pop replaced album rock in terms of proportion, but rock music still held a significant share when grouped together. Hip hop, though lower in proportion, was very popular.

2010~2019

tens_songdata %>% group_by(Genre) %>% 
  summarize(Popularity=mean(Popularity))%>%
  arrange(desc(Popularity)) %>% subset(Popularity>70)%>%
  hchart(type="bar",hcaes(x=Genre, y=Popularity))%>%
  hc_xAxis(title = list(text = "歌曲種類"))%>%
  hc_yAxis(title = list(text = "特定種類的平均熱門程度"))

After 2010, exotic-style music achieved high popularity. Listeners’ tastes became more diverse, but pop music remained dominant.

tens_songdata%>%
  count(Genre)%>%
  arrange(n)%>%
  hchart(type = "treemap", hcaes(x = Genre, value = n, color = n))

Relationships Between Music Features

Overall Observation

spotify_characteristic <- spotify %>% select(Popularity,Energy,Danceability,Loudness,Liveness,Valence,Acousticness,Speechiness)
names(spotify_characteristic)<-c("Popularity", "Energy", "Danceability","Loudness","Liveness","Valence","Acousticness","Speechiness")
GGally::ggcorr(data = spotify_characteristic, method = c("complete", "pearson"))

Popularity shows some positive correlation with Energy and Danceability; other features show little or no significant correlation. Energy and Loudness are highly correlated, suggesting potential multicollinearity. We will next examine songs with Popularity > 70.

Observation Popularity>70

spotify_popularity70 <- subset(spotify, Popularity >= 70)

spotify.active70 <- spotify_popularity70 %>% 
  select(Popularity,Energy,Danceability,Loudness,
         Liveness,Valence,Acousticness,Speechiness)

GGally::ggcorr(data = spotify.active70, method = c("complete", "pearson"))

For songs with Popularity > 70, only Loudness and Danceability maintain relatively high positive correlation, while other features show low or negligible correlation.

2.Build PCA Model

Regression Analysis of Song Popularity on Features

spotify1<-spotify[,-c(1:3)]
spotify_mc <- data.frame((spotify1))
lm_pop_to_char<-lm(Popularity~Year+Bpm+Energy+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_mc)
summary(lm_pop_to_char)
## 
## Call:
## lm(formula = Popularity ~ Year + Bpm + Energy + Danceability + 
##     Loudness + Liveness + Valence + Length + Acousticness + Speechiness, 
##     data = spotify_mc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.469  -9.165   1.158   9.723  37.842 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  553.833070  41.435305  13.366  < 2e-16 ***
## Year          -0.256877   0.020997 -12.234  < 2e-16 ***
## Bpm            0.002382   0.011090   0.215 0.829934    
## Energy        -0.100553   0.027529  -3.653 0.000266 ***
## Danceability   0.154640   0.024416   6.334 2.96e-10 ***
## Loudness       1.375388   0.134148  10.253  < 2e-16 ***
## Liveness      -0.091328   0.018534  -4.927 9.02e-07 ***
## Valence       -0.028175   0.017075  -1.650 0.099083 .  
## Length        -0.003547   0.004774  -0.743 0.457526    
## Acousticness  -0.028104   0.014154  -1.986 0.047219 *  
## Speechiness    0.337251   0.070584   4.778 1.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.37 on 1983 degrees of freedom
## Multiple R-squared:  0.1359, Adjusted R-squared:  0.1315 
## F-statistic: 31.19 on 10 and 1983 DF,  p-value: < 2.2e-16

From the previous correlation plot, potential multicollinearity is observed. We will check individual VIF values:

vif<-data.frame(vif(lm(Popularity~Year+Bpm+Energy+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_mc)))

vif<-vif %>% summarize(variables=c("Year","Bpm","Energy","Danceability","Loudness","Liveness","Valence","Length","Acousticness","Speechiness"),vif_value=vif.lm.Popularity...Year...Bpm...Energy...Danceability...Loudness...) %>% arrange(desc(vif_value))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
knitr::kable(vif, width = 100) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  row_spec(1, bold = T, color = "white", background = "#D7261E")
variables vif_value
Energy 4.144251
Loudness 2.668141
Valence 2.007379
Acousticness 1.878791
Danceability 1.565367
Year 1.275841
Length 1.123852
Bpm 1.076494
Speechiness 1.075458
Liveness 1.070967

Energy has a VIF > 4, indicating multicollinearity. We check which variables are associated with Energy:

summary(lm(Energy~Year+Bpm+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_mc))
## 
## Call:
## lm(formula = Energy ~ Year + Bpm + Danceability + Loudness + 
##     Liveness + Valence + Length + Acousticness + Speechiness, 
##     data = spotify_mc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.589  -7.230  -0.144   7.243  33.618 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  121.084515  33.682609   3.595 0.000332 ***
## Year          -0.057188   0.017076  -3.349 0.000826 ***
## Bpm            0.025237   0.009027   2.796 0.005226 ** 
## Danceability  -0.078886   0.019833  -3.977 7.22e-05 ***
## Loudness       3.272998   0.081053  40.381  < 2e-16 ***
## Liveness       0.104147   0.014934   6.974 4.18e-12 ***
## Valence        0.236554   0.012873  18.376  < 2e-16 ***
## Length         0.023902   0.003856   6.198 6.93e-10 ***
## Acousticness  -0.264910   0.009893 -26.777  < 2e-16 ***
## Speechiness    0.387968   0.056902   6.818 1.22e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.91 on 1984 degrees of freedom
## Multiple R-squared:  0.7587, Adjusted R-squared:  0.7576 
## F-statistic: 693.1 on 9 and 1984 DF,  p-value: < 2.2e-16

Almost all variables are related to Energy. Therefore, we remove Energy to examine the effect on multicollinearity of other factors:

vif1<-data.frame(vif(lm(Popularity~Year+Bpm+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_mc)))
vif1<-vif1 %>% summarize(variables=c("Year","Bpm","Danceability","Loudness","Liveness","Valence","Length","Acousticness","Speechiness"),vif_value=vif.lm.Popularity...Year...Bpm...Danceability...Loudness...Liveness...) %>% arrange(desc(vif_value))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
knitr::kable(vif1) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
variables vif_value
Valence 1.715415
Danceability 1.552984
Loudness 1.464489
Acousticness 1.380043
Year 1.268668
Length 1.102504
Bpm 1.072269
Speechiness 1.050836
Liveness 1.045341

All VIF values are now below 4, reducing model error in subsequent regression analysis.

Principal Component Analysis (PCA) to Explain Popularity Variance

PCA without Energy:

spotify_test <- spotify_mc[,-c(3,11)]
pca_spotify<-prcomp(spotify_test,scale. = TRUE)
pca_spotify
## Standard deviations (1, .., p=9):
## [1] 1.3813552 1.2300727 1.0576690 1.0198534 0.9887333 0.9598953 0.8464142
## [8] 0.6841995 0.5800653
## 
## Rotation (n x k) = (9 x 9):
##                     PC1         PC2          PC3         PC4         PC5
## Year         -0.2238406 -0.40828118  0.512756050 -0.32345368  0.13113839
## Bpm          -0.1115834 -0.23095177 -0.565134582  0.19366987  0.59958954
## Danceability -0.3990834  0.46759173  0.270236804  0.09894569 -0.13453300
## Loudness     -0.4840906 -0.37429374  0.126137593 -0.08400684  0.07613662
## Liveness     -0.0905829 -0.17276110 -0.449740455 -0.46047232 -0.56840114
## Valence      -0.4628039  0.46949379 -0.190431966  0.11180372 -0.01035366
## Length        0.1113730 -0.33015937  0.002742636  0.63348604 -0.50048917
## Acousticness  0.4882861  0.24802664  0.024352221 -0.37726925  0.08891998
## Speechiness  -0.2657316 -0.02089687 -0.299389631 -0.26990890 -0.13347645
##                      PC6        PC7          PC8         PC9
## Year          0.14526643 -0.4648746 -0.172122773 -0.36280373
## Bpm           0.07351902 -0.4333267  0.003888681  0.15692517
## Danceability  0.17496027 -0.4048524 -0.034899680  0.57047218
## Loudness     -0.19917662  0.2885415  0.642886482  0.24610027
## Liveness     -0.33256940 -0.3049853 -0.084531346  0.11748738
## Valence      -0.18116459 -0.1112328  0.220705729 -0.65008630
## Length        0.24525793 -0.2683873  0.279089579 -0.12928721
## Acousticness  0.18367211 -0.3056474  0.649351531 -0.02035608
## Speechiness   0.81602714  0.2794944 -0.020800028 -0.06480015
summary(pca_spotify)
## Importance of components:
##                          PC1    PC2    PC3    PC4    PC5    PC6    PC7     PC8
## Standard deviation     1.381 1.2301 1.0577 1.0199 0.9887 0.9599 0.8464 0.68420
## Proportion of Variance 0.212 0.1681 0.1243 0.1156 0.1086 0.1024 0.0796 0.05201
## Cumulative Proportion  0.212 0.3801 0.5044 0.6200 0.7286 0.8310 0.9106 0.96261
##                            PC9
## Standard deviation     0.58007
## Proportion of Variance 0.03739
## Cumulative Proportion  1.00000

PC1–PC3 together explain over half of the variance. To compare with noise, we simulate noise 100 times:

sim_noise_ev <- function(n, p) {
  noise <- data.frame(replicate(p, rnorm(n))) 
  eigen(cor(noise))$values
}
evalues_noise <- replicate(100, sim_noise_ev(nrow(spotify),ncol(spotify)))
evalues_mean <- apply(evalues_noise, 1, mean)
screeplot(pca_spotify,type="line")
abline(h=1,col="grey",lty="dotted")
lines(evalues_mean, type="b",col="blue")

PC1, PC2, and PC3 are the most representative components of variance compared to noise.

PCA Loadings

top4.pca.eigenvector <- pca_spotify$rotation[,1:4]
first.pca <- top4.pca.eigenvector[, 1] 
second.pca <- top4.pca.eigenvector[, 2] 
third.pca <- top4.pca.eigenvector[, 3] 

PC1

dotchart(first.pca[order(first.pca, decreasing=FALSE)] ,   
    main="Loading Plot for PC1",xlab="Variable Loadings", col="red")

PC1 shows the largest positive loading for Acousticness (acousticness). Length is close to zero and not considered.
first.pca
##         Year          Bpm Danceability     Loudness     Liveness      Valence 
##   -0.2238406   -0.1115834   -0.3990834   -0.4840906   -0.0905829   -0.4628039 
##       Length Acousticness  Speechiness 
##    0.1113730    0.4882861   -0.2657316
We define PC1 as:
PC1 = -0.2238*Year -0.1116*BPM -0.3991*Danceability -0.4841*Loudness -0.0906*Liveness -0.4628*Valence +0.1114*Length +0.4883*Acousticness -0.2657*Speechiness

PC2

dotchart(second.pca[order(second.pca, decreasing=FALSE)] ,   
    main="Loading Plot for PC2",xlab="Variable Loadings", col="red")

PC2 shows largest positive loading for song length; other positive contributions include Loudness, Energy, and BPM.
second.pca
##         Year          Bpm Danceability     Loudness     Liveness      Valence 
##  -0.40828118  -0.23095177   0.46759173  -0.37429374  -0.17276110   0.46949379 
##       Length Acousticness  Speechiness 
##  -0.33015937   0.24802664  -0.02089687
We define PC2 as:
PC2 = 0.4083*Year +0.2310*BPM -0.4676*Danceability +0.3743*Loudness +0.1728*Liveness -0.4695*Valence +0.3302*Length -0.2480*Acousticness +0.0209*Speechiness

PC3

dotchart(third.pca[order(third.pca, decreasing=FALSE)] ,   
    main="Loading Plot for PC3",xlab="Variable Loadings", col="red")

PC3 shows largest positive loading for song length. PC1 relates most to acoustic features.
third.pca
##         Year          Bpm Danceability     Loudness     Liveness      Valence 
##  0.512756050 -0.565134582  0.270236804  0.126137593 -0.449740455 -0.190431966 
##       Length Acousticness  Speechiness 
##  0.002742636  0.024352221 -0.299389631
PC3 defined as:
PC3 = -0.5128*Year +0.5651*BPM -0.2702*Danceability -0.1846*Loudness +0.4497*Liveness +0.1904*Valence -0.0027*Length -0.0244*Acousticness +0.2994*Speechiness 

Build Regression Model

Regression Using PCA Scores

pre <- predict(pca_spotify)
spotify_test$pc1 <- pre[,1]
spotify_test$pc2 <- pre[,2]
spotify_test$pc3 <- pre[,3]
spotify_test$Popularity <- spotify_mc$Popularity
lm.sol<-lm(Popularity~pc1+pc2+pc3, data=spotify_test)
summary(lm.sol)
## 
## Call:
## lm(formula = Popularity ~ pc1 + pc2 + pc3, data = spotify_test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.565  -9.202   2.189  10.700  38.435 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59.5266     0.3161 188.329  < 2e-16 ***
## pc1          -1.5989     0.2289  -6.986 3.85e-12 ***
## pc2           1.1723     0.2570   4.561 5.40e-06 ***
## pc3          -0.3002     0.2989  -1.004    0.315    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.11 on 1990 degrees of freedom
## Multiple R-squared:  0.03427,    Adjusted R-squared:  0.03281 
## F-statistic: 23.54 on 3 and 1990 DF,  p-value: 5.719e-15

Popularity is highly correlated with PC1 and PC2 (p < 0.01).

coef<-coef(lm.sol)
coef
## (Intercept)         pc1         pc2         pc3 
##  59.5265797  -1.5988774   1.1722783  -0.3002102

Linear regression model: Popularity = 59.5266 - 1.1954PC1 - 1.4957PC2 + 0.3693*PC3

K-Fold Cross Validation

k_fold_mse <- function( k, model,outcome,dataset) { 
  shuffled_indicies <- sample(1:nrow(dataset))
  dataset <- dataset[shuffled_indicies,]
  
  fold_pred_errors <- sapply(1:k, \(i) {
  fold_i_pe(i, k, dataset, model,outcome) 
  })
  pred_errors <- unlist(fold_pred_errors) 
  mse<-\(errs) mean(errs^2)
  c(is=mse(residuals(model)),oos=mse(pred_errors))
}

fold_i_pe <- function(i, k, dataset, model,outcome) { 
  folds <- cut(1:nrow(dataset),k, labels = FALSE)
  test_indices <- which(folds==i) 
  test_set <- dataset[test_indices,] 
  train_set <- dataset[-test_indices,] 
  trained_model <- update(model,data=train_set)
  predictions <- predict(trained_model, test_set)
  dataset[test_indices,outcome] - predictions 
}
pc_mse<-k_fold_mse(10,lm.sol,"Popularity",spotify_test)
pc_mse
##       is      oos 
## 198.8105 199.4796

In-sample MSE: 198.8105, Out-of-sample MSE: 199.6508.

3. Decision Tree Model

Build Decision Tree

spotify_tree <- rpart(Popularity~Year+Bpm+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_test)
rpart.plot(spotify_tree)

Decision Tree Model MSE

mse_oos_decision <-mse_oos(spotify_test$Popularity,predict(spotify_tree,spotify_test))
mse_oos_decision
## [1] 180.535

4.Random Forest Model

Build Random Forest

plot(randomforest_spotify)

The plot shows the error gradually decreases during training, making the model closer to the actual values.
varImpPlot(randomforest_spotify)

The Variable Importance Plot shows the most important factors: Year, Length, and Danceability are highly important.

Random Forest MSE

set.seed(1)
randomforest_spotify <- randomForest(Popularity~Year+Bpm+Danceability+Loudness+Liveness+Valence+Length+Acousticness+Speechiness,data=spotify_test,importane=T,proximity=T,do.trace=100)
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  100 |    177.8    86.35 |
##  200 |    175.5    85.24 |
##  300 |    175.3    85.17 |
##  400 |      175    85.01 |
##  500 |    174.2    84.61 |

5.Comparison of Three Models

We compare the MSE of PCA, Decision Tree, and Random Forest models:

mse_compare <- data.frame(model=c("PCA test","Decision tree","Random forest"),mse_oos = as.numeric(c(199.3977,180.535,174.2)))
mse_compare %>% arrange(-mse_oos)%>%
  ggplot(aes(x=model,y=mse_oos,fill=model))+
  geom_bar(stat="identity", width=0.5)+
  scale_fill_brewer(palette="Greens")+
  geom_text(aes(label = mse_oos), vjust = 2)

Random Forest achieves the lowest MSE, so we choose it for predictions.

6. Demo: New Data

We test the model with songs from 2019 using Random Forest to predict popularity:

new_song <- read.csv("New song.csv",header=T)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'New song.csv'
names(new_song)<-c("Title","Artist","Year","Genre","Bpm", "Energy", "Danceability","Loudness","Liveness","Valence","Length","Acousticness","Speechiness","Popularity")

new_song$Length = as.integer(as.factor(new_song$Length))
new_song

Predict popularity using Random Forest:

predicted_data <- predict(randomforest_spotify, newdata=predict_new_song)
predicted_data
##        1 
## 65.96297

Compare predicted vs actual values:

Demo 結果

Popularity <- c("predicted","real")
Value <- c(round(predicted_data,3),new_song$Popularity)
compare <- data.frame(Popularity,Value)
compare %>% ggplot(aes(x=Popularity,y=Value,fill=Popularity))+
  geom_bar(stat="identity", position=position_dodge(),width=0.5)+
  scale_fill_brewer(palette="Greens")+
  geom_text(aes(label = Value), vjust = 2)

Prediction: 65.963 vs actual: 75, difference = 9.037 (~12.05% of actual).

7.Shiny Interactive Webpage

We designed a Shiny app to allow users to input features and get predicted popularity:

https://samuel0731.shinyapps.io/spotify_shiny_website/

knitr::include_graphics("demo1.png")

Top shows historical song data; bottom-left: user inputs song name, artist, and features:

knitr::include_graphics("demo2.png")

Right panel shows predicted popularity and comparison with other songs.

Conclusion

This study analyzed the evolution of music genres over decades using visualizations. Using PCA, Decision Tree, and Random Forest, we found Random Forest yields the lowest MSE, making it the best prediction model. Finally, we implemented a Shiny interactive platform for user-friendly predictions.